A Statistical Model for Unsupervised and Semi-supervised Transliteration Mining
نویسندگان
چکیده
We propose a novel model to automatically extract transliteration pairs from parallel corpora. Our model is efficient, language pair independent and mines transliteration pairs in a consistent fashion in both unsupervised and semi-supervised settings. We model transliteration mining as an interpolation of transliteration and non-transliteration sub-models. We evaluate on NEWS 2010 shared task data and on parallel corpora with competitive results.
منابع مشابه
Statistical models for unsupervised, semi-supervised and supervised transliteration mining
We present a generative model that efficiently mines transliteration pairs in a consistent fashion in three different settings, unsupervised, semi-supervised and supervised transliteration mining. The model interpolates two sub-models, one for the generation of transliteration pairs and one for the generation of non-transliteration pairs (i.e. noise). The model is trained on noisy unlabelled da...
متن کاملSemi-supervised transliteration mining from parallel and comparable corpora
Transliteration is the process of writing a word (mainly proper noun) from one language in the alphabet of another language. This process requires mapping the pronunciation of the word from the source language to the closest possible pronunciation in the target language. In this paper we introduce a new semi-supervised transliteration mining method for parallel and comparable corpora. The metho...
متن کاملQCRI-MES Submission at WMT13: Using Transliteration Mining to Improve Statistical Machine Translation
This paper describes QCRI-MES’s submission on the English-Russian dataset to the Eighth Workshop on Statistical Machine Translation. We generate improved word alignment of the training data by incorporating an unsupervised transliteration mining module to GIZA++ and build a phrase-based machine translation system. For tuning, we use a variation of PRO which provides better weights by optimizing...
متن کاملStatistical machine learning for data mining and collaborative multimedia retrieval
of thesis entitled: Statistical Machine Learning for Data Mining and Collaborative Multimedia Retrieval Submitted by HOI, Chu Hong (Steven) for the degree of Doctor of Philosophy at The Chinese University of Hong Kong in September 2006 Statistical machine learning techniques have been widely applied in data mining and multimedia information retrieval. While traditional methods, such as supervis...
متن کاملSemi-Supervised Lexicon Mining from Parenthetical Expressions in Monolingual Web Pages
This paper presents a semi-supervised learning framework for mining Chinese-English lexicons from large amount of Chinese Web pages. The issue is motivated by the observation that many Chinese neologisms are accompanied by their English translations in the form of parenthesis. We classify parenthetical translations into bilingual abbreviations, transliterations, and translations. A frequency-ba...
متن کامل